{ "cells": [ { "cell_type": "markdown", "id": "6d6e66b9-2dcc-4f1d-97ab-0393dbb18afe", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## 2 - The General Linear Model\n", "### Getting to grips with linear models in Python" ] }, { "cell_type": "markdown", "id": "5981ad66-0254-4287-ae75-a4665c3102de", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "This week we will focus on three things:\n", "- How to do basic, psychology-standard analyses in Python using the `pingouin` package\n", "- How to implement a general linear model in Python with `statsmodels`\n", "- and understanding the connection between the two with some examples.\n", "\n", "Don't worry if this confusing. It takes many repetitions and practice to get it to stick!" ] }, { "cell_type": "markdown", "id": "6ef352ef-4faa-4eb2-8a0a-e238affe6c4f", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Part 1 - The `pingouin` package for basic statistics\n", "You will be familiar with basic statistics in psychology at this point, and it is very useful to know how to do them in Python. The `pingouin` package covers most of these, and it works seamlessly with `pandas`. As before we need to import it, so we will call our needed packages here." ] }, { "cell_type": "code", "execution_count": 1, "id": "0cb0f8ca-83c4-4e98-9c52-fdd8d604ba5e", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [], "source": [ "# Import what we need\n", "import pandas as pd # dataframes\n", "import seaborn as sns # plots\n", "import pingouin as pg # stats, note the traditional alias\n", "\n", "# Set the style for plots\n", "sns.set_style('dark') # a different theme!\n", "sns.set_context('talk')" ] }, { "cell_type": "markdown", "id": "cb640d9f-eaa2-4fc8-814a-5752b41c20c2", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "source": [ "Lets see how this package allows us to do some basic psychology-style analyses, with some example datasets that we can load from `seaborn`." ] }, { "cell_type": "markdown", "id": "5a609664-abf1-4cd9-adee-8f277ea067e7", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "#### The **independent** samples t-test\n", "This t-test is often the first analysis we learn. With `pingouin`, it is handled by the `pg.ttest` function. Lets load up the `tips` dataset and do a t-test.\n", "\n", "**Test**: Do smokers tip more than non-smokers? To test this we need to give the function just the values for each group, which we get by filtering our data." ] }, { "cell_type": "code", "execution_count": 2, "id": "68d19d7c-3275-4f0f-b302-5216264343ea", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "0 16.99 1.01 Female No Sun Dinner 2\n", "1 10.34 1.66 Male No Sun Dinner 3" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# First load the data\n", "tips = sns.load_dataset('tips')\n", "display(tips.head(2))" ] }, { "cell_type": "code", "execution_count": 3, "id": "3e4e7b04-d4c0-45af-bb0d-97f3ea38a5a1", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Tdofalternativep-valCI95%cohen-dBF10power
T-test0.0918192.2634two-sided0.9269[-0.35, 0.38]0.01220.1450.051
\n", "
" ], "text/plain": [ " T dof alternative p-val CI95% cohen-d BF10 \\\n", "T-test 0.0918 192.2634 two-sided 0.9269 [-0.35, 0.38] 0.0122 0.145 \n", "\n", " power \n", "T-test 0.051 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# First filter the data by smokers and nonsmokers, and select the tip column\n", "smokers = tips.query('smoker == \"Yes\"')['tip']\n", "nonsmokers = tips.query('smoker == \"No\"')['tip']\n", "\n", "# Pass to the t-test function\n", "pg.ttest(smokers, nonsmokers).round(4)" ] }, { "cell_type": "markdown", "id": "19c89a5f-e5b8-4c0f-82bd-764a3df6cf23", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "#### The **paired** samples t-test \n", "Comparing the means of two variables is achieved with the same function, but telling it that the two variables are paired. Lets compare the mean bill price with the mean tip price - meals should be on average more expensive than their tips!" ] }, { "cell_type": "code", "execution_count": 4, "id": "54d2e488-0eb5-4d64-baa6-ed0671eac15a", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Tdofalternativep-valCI95%cohen-dBF10power
T-test32.6465243two-sided0.0[15.77, 17.8]2.63521.222e+871.0
\n", "
" ], "text/plain": [ " T dof alternative p-val CI95% cohen-d BF10 \\\n", "T-test 32.6465 243 two-sided 0.0 [15.77, 17.8] 2.6352 1.222e+87 \n", "\n", " power \n", "T-test 1.0 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select the columns\n", "bills = tips['total_bill']\n", "tipamount = tips['tip']\n", "\n", "# Do the test\n", "pg.ttest(bills, tipamount, paired=True).round(4)" ] }, { "cell_type": "markdown", "id": "cc893df3-f81d-46ef-9b37-02a8fd994f2b", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "source": [ "One-sample tests are also easily achieved by stating the target value as the second input. For example, to compare the mean tip amount to $1, we do `pg.ttest(tipamount, 1)`." ] }, { "cell_type": "markdown", "id": "ccd03bd9-0aeb-4267-8e4c-9115f46947bc", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "#### Correlation\n", "This useful statistic is a special case of the general linear model as we will see, but it is computed easily enough with `pg.corr`.\n", "\n", "Let us correlate the tip amount and total bill amount:" ] }, { "cell_type": "code", "execution_count": 5, "id": "cd2328d7-842d-41a0-99cf-e32aabedb4dd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nrCI95%p-valBF10power
pearson2440.675734[0.6, 0.74]6.692471e-344.952e+301.0
\n", "
" ], "text/plain": [ " n r CI95% p-val BF10 power\n", "pearson 244 0.675734 [0.6, 0.74] 6.692471e-34 4.952e+30 1.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Correlation\n", "pg.corr(tips['total_bill'], tips['tip'])" ] }, { "cell_type": "markdown", "id": "42e2df7f-0249-4018-9902-1323c9bf1035", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "#### ANOVA\n", "The analysis of variance (more about the difference in means, actually) comes in many forms - one way, two way, mixed, repeated, analysis of covariance, and so on. We need not dwell on these confusing definitions, but it is good to demonstrate how they are handled." ] }, { "cell_type": "markdown", "id": "fb990371-444f-4a65-b780-82e5ae923c5b", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "#### One-way ANOVA\n", "Designed for testing mean differences in a continuous outcome under two or more groups, in the `pg.anova` function. Let's examine whether the amount of tips vary over different days - the `tips` dataset has Thursday-Sunday as days. A plot will help us." ] }, { "cell_type": "code", "execution_count": 6, "id": "174a7301-ab45-4925-bb81-196593993c59", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot tips\n", "sns.barplot(data=tips, x='day', y='tip');" ] }, { "cell_type": "code", "execution_count": 7, "id": "c762ea9c-27ec-4d87-9646-413fa37b8cc1", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sourceddof1ddof2Fp-uncnp2
0day32401.6723550.1735890.020476
\n", "
" ], "text/plain": [ " Source ddof1 ddof2 F p-unc np2\n", "0 day 3 240 1.672355 0.173589 0.020476" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# A one-way ANOVA tests for differences amongst those means\n", "# Notice we pass tips to the 'data' argument. \n", "# Notice 'between' refers to the between subjects measure \n", "# we can add more variables to 'between'\n", "pg.anova(data=tips, dv='tip', between=['day'])" ] }, { "cell_type": "markdown", "id": "1bc38d71-8273-40d4-aace-94dfe6f87eb8", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "Extending this to a 'two-way' ANOVA is easy - tip differences among days and men and women?" ] }, { "cell_type": "code", "execution_count": 8, "id": "f9d8d188-4d80-4d73-8c06-e0c0ccbed936", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot tips\n", "sns.barplot(data=tips, x='day', y='tip', hue='sex');" ] }, { "cell_type": "code", "execution_count": 9, "id": "fc9f5f25-8075-4958-8cff-58944cde2209", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SourceSSDFMSFp-uncnp2
0day7.4469003.02.4823001.2980610.2757850.016233
1sex1.5945611.01.5945610.8338390.3620970.003521
2day * sex2.7858913.00.9286300.4856060.6926000.006135
3Residual451.306151236.01.912314NaNNaNNaN
\n", "
" ], "text/plain": [ " Source SS DF MS F p-unc np2\n", "0 day 7.446900 3.0 2.482300 1.298061 0.275785 0.016233\n", "1 sex 1.594561 1.0 1.594561 0.833839 0.362097 0.003521\n", "2 day * sex 2.785891 3.0 0.928630 0.485606 0.692600 0.006135\n", "3 Residual 451.306151 236.0 1.912314 NaN NaN NaN" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Extend ANOVA\n", "pg.anova(data=tips, between=['day', 'sex'], dv='tip')" ] }, { "cell_type": "markdown", "id": "261e294d-79fa-4c0a-9515-c12c275f4819", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "Further cases can be handled such as:\n", "- Analysis of Covariance, ANCOVA - 'controlling' for a variable before the ANOVA is run, `pg.ancova`.\n", "- Repeated measures ANOVA - analysing fully repeated measures, all within-participants, `pg.rm_anova`.\n", "- Mixed ANOVA - analysing data with one variable measured between, and another within, `pg.mixed_anova`.\n", "\n", "The latter two will required your data in **long-form**. `\n", "pingouin` has loads of other options you can try too for many things outside of ANOVA." ] }, { "cell_type": "markdown", "id": "47e4e28e-0b24-49d1-b660-ed1964f12324", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Part 2 - `statsmodels` for general linear models\n", "Now for the main course. This approach will serve us for most of the rest of the module, and serve you well for the future!\n", "\n", "`statsmodels` is a package for building statistical models, and gives us full flexibility in building our models. It works in conjunction with `pandas` dataframes, integrating them with a *formula string* interface that lets us specify our model structure in a simple way. \n", "\n", "First, lets import it." ] }, { "cell_type": "code", "execution_count": 10, "id": "3ab8786c-d87b-4775-a115-5a4c56da3c93", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [], "source": [ "# Import statsmodels formula interface\n", "import statsmodels.formula.api as smf \n", "# looks a little different due to structure of package" ] }, { "cell_type": "markdown", "id": "677c6345-26d6-4796-96ce-8563a90ecd2d", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "Let's build a model that predicts tip amount from the bill and number of diners. To do this, we use `statsmodels` `ols` function, pass the string that defines the model, and the data, and tell it `.fit()`. A lot of steps, but simple ones:\n" ] }, { "cell_type": "code", "execution_count": 11, "id": "ab4ec5c5-d1a4-415b-9e04-0760ab96efbc", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [], "source": [ "# Build and fit our model\n", "first_model = smf.ols('tip ~ 1 + total_bill + size', data=tips).fit()" ] }, { "cell_type": "markdown", "id": "a63a6b78-60b2-491c-8561-8168e536c230", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "Things to note:\n", "- The names of the included variables are the column names of the dataframe.\n", "- The DV is on the left and separated with `~`, which can be read as 'is predicted/modelled by'.\n", "- The `1` means 'fit an intercept`. We almost always want this.\n", "- We include predictors by literally adding them to the model with a plus.\n", "\n", "Once we have fit this model, we can ask it to provide us with a `.summary()`, which gives the readout of information in the regression." ] }, { "cell_type": "code", "execution_count": 12, "id": "10796c9d-a9c0-435f-8864-ef9450fbf454", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: tip R-squared: 0.468
Model: OLS Adj. R-squared: 0.463
No. Observations: 244 F-statistic: 105.9
Covariance Type: nonrobust Prob (F-statistic): 9.67e-34
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 0.6689 0.194 3.455 0.001 0.288 1.050
total_bill 0.0927 0.009 10.172 0.000 0.075 0.111
size 0.1926 0.085 2.258 0.025 0.025 0.361


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & tip & \\textbf{ R-squared: } & 0.468 \\\\\n", "\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.463 \\\\\n", "\\textbf{No. Observations:} & 244 & \\textbf{ F-statistic: } & 105.9 \\\\\n", "\\textbf{Covariance Type:} & nonrobust & \\textbf{ Prob (F-statistic):} & 9.67e-34 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{Intercept} & 0.6689 & 0.194 & 3.455 & 0.001 & 0.288 & 1.050 \\\\\n", "\\textbf{total\\_bill} & 0.0927 & 0.009 & 10.172 & 0.000 & 0.075 & 0.111 \\\\\n", "\\textbf{size} & 0.1926 & 0.085 & 2.258 & 0.025 & 0.025 & 0.361 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{OLS Regression Results}\n", "\\end{center}\n", "\n", "Notes: \\newline\n", " [1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: tip R-squared: 0.468\n", "Model: OLS Adj. R-squared: 0.463\n", "No. Observations: 244 F-statistic: 105.9\n", "Covariance Type: nonrobust Prob (F-statistic): 9.67e-34\n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Intercept 0.6689 0.194 3.455 0.001 0.288 1.050\n", "total_bill 0.0927 0.009 10.172 0.000 0.075 0.111\n", "size 0.1926 0.085 2.258 0.025 0.025 0.361\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show the summary of the fitted model\n", "first_model.summary(slim=True)\n", "# slim=False has more, but less relevant, info" ] }, { "cell_type": "markdown", "id": "b114ceba-d5eb-46b1-ab57-6a587a5ba48c", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "The `formula` interface is very flexible and allows you to alter the variables in the dataframe by modifying the formula. For example, if we want to z-score our predictors, we can use the `scale` function directly in the formula, and it will adjust the variables for us:" ] }, { "cell_type": "code", "execution_count": 13, "id": "d6f6956d-4c2a-4ae2-ad50-09db6211956f", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: tip R-squared: 0.468
Model: OLS Adj. R-squared: 0.463
No. Observations: 244 F-statistic: 105.9
Covariance Type: nonrobust Prob (F-statistic): 9.67e-34
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 2.9983 0.065 46.210 0.000 2.870 3.126
scale(total_bill) 0.8237 0.081 10.172 0.000 0.664 0.983
scale(size) 0.1828 0.081 2.258 0.025 0.023 0.342


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & tip & \\textbf{ R-squared: } & 0.468 \\\\\n", "\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.463 \\\\\n", "\\textbf{No. Observations:} & 244 & \\textbf{ F-statistic: } & 105.9 \\\\\n", "\\textbf{Covariance Type:} & nonrobust & \\textbf{ Prob (F-statistic):} & 9.67e-34 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{Intercept} & 2.9983 & 0.065 & 46.210 & 0.000 & 2.870 & 3.126 \\\\\n", "\\textbf{scale(total\\_bill)} & 0.8237 & 0.081 & 10.172 & 0.000 & 0.664 & 0.983 \\\\\n", "\\textbf{scale(size)} & 0.1828 & 0.081 & 2.258 & 0.025 & 0.023 & 0.342 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{OLS Regression Results}\n", "\\end{center}\n", "\n", "Notes: \\newline\n", " [1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: tip R-squared: 0.468\n", "Model: OLS Adj. R-squared: 0.463\n", "No. Observations: 244 F-statistic: 105.9\n", "Covariance Type: nonrobust Prob (F-statistic): 9.67e-34\n", "=====================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "-------------------------------------------------------------------------------------\n", "Intercept 2.9983 0.065 46.210 0.000 2.870 3.126\n", "scale(total_bill) 0.8237 0.081 10.172 0.000 0.664 0.983\n", "scale(size) 0.1828 0.081 2.258 0.025 0.023 0.342\n", "=====================================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Scale predictors\n", "scale_predictors = smf.ols('tip ~ 1 + scale(total_bill) + scale(size)', \n", " data=tips).fit()\n", "scale_predictors.summary(slim=True)" ] }, { "cell_type": "markdown", "id": "48ac5416-3206-41b9-84fa-393af7ca6148", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "Extending to the DV is also simple - this is the 'standardised coefficients' you see in some statistics software:" ] }, { "cell_type": "code", "execution_count": 14, "id": "f8e1eca1-e94e-46f4-adae-a770f73d2e05", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: scale(tip) R-squared: 0.468
Model: OLS Adj. R-squared: 0.463
No. Observations: 244 F-statistic: 105.9
Covariance Type: nonrobust Prob (F-statistic): 9.67e-34
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept -3.851e-16 0.047 -8.2e-15 1.000 -0.093 0.093
scale(total_bill) 0.5965 0.059 10.172 0.000 0.481 0.712
scale(size) 0.1324 0.059 2.258 0.025 0.017 0.248


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & scale(tip) & \\textbf{ R-squared: } & 0.468 \\\\\n", "\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.463 \\\\\n", "\\textbf{No. Observations:} & 244 & \\textbf{ F-statistic: } & 105.9 \\\\\n", "\\textbf{Covariance Type:} & nonrobust & \\textbf{ Prob (F-statistic):} & 9.67e-34 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{Intercept} & -3.851e-16 & 0.047 & -8.2e-15 & 1.000 & -0.093 & 0.093 \\\\\n", "\\textbf{scale(total\\_bill)} & 0.5965 & 0.059 & 10.172 & 0.000 & 0.481 & 0.712 \\\\\n", "\\textbf{scale(size)} & 0.1324 & 0.059 & 2.258 & 0.025 & 0.017 & 0.248 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{OLS Regression Results}\n", "\\end{center}\n", "\n", "Notes: \\newline\n", " [1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: scale(tip) R-squared: 0.468\n", "Model: OLS Adj. R-squared: 0.463\n", "No. Observations: 244 F-statistic: 105.9\n", "Covariance Type: nonrobust Prob (F-statistic): 9.67e-34\n", "=====================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "-------------------------------------------------------------------------------------\n", "Intercept -3.851e-16 0.047 -8.2e-15 1.000 -0.093 0.093\n", "scale(total_bill) 0.5965 0.059 10.172 0.000 0.481 0.712\n", "scale(size) 0.1324 0.059 2.258 0.025 0.017 0.248\n", "=====================================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Scale predictors\n", "scale_all = smf.ols('scale(tip) ~ 1 + scale(total_bill) + scale(size)',\n", " data=tips).fit()\n", "scale_all.summary(slim=True)" ] }, { "cell_type": "markdown", "id": "ca6c64fd-2357-48f8-86c6-247ae7655bf4", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "#### Extra scores and attributes\n", "While the `.summary()` printout has a lot of useful features, there are a few other things to know about. \n", "- the `.fittedvalues` attribute contains the actual predictions.\n", "- the `.resid` attribute contains the residuals or errors.\n", "- `.summary()` doesn't show us the RMSE, or how wrong our model is on average. We can import a function from `statsmodels` to do that for us, which we do next.\n", "\n", "We need to import the `import statsmodels.tools.eval_measures` package, and use its `.rmse` function. We need the predictions and the actual data, which is easy to access!" ] }, { "cell_type": "code", "execution_count": 15, "id": "2c8ec745-f13b-4c25-a29c-e11ac28340aa", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "1.007256127114662" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import it with the meausres name\n", "import statsmodels.tools.eval_measures as measures\n", "measures.rmse(first_model.fittedvalues, tips['tip']) \n", "# notice we put in the predictions, and then the actual values" ] }, { "cell_type": "markdown", "id": "62d995e0-89f3-44af-8d72-19e30edaeb5d", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Part 3 - Wait, its just the general linear model?\n", "Now we've seen how easy it is to fit models, we can demonstrate some equivalences between standard statistics and how they are instances of the general linear model. First, a correlation.\n", "\n", "A correlation is a general linear model, when you have **one predictor**, and you **standardise both the predictor and the dependent variable**.\n", "\n", "Lets re-correlate tips and total bill:" ] }, { "cell_type": "code", "execution_count": 16, "id": "9ff31dc0-034e-44fc-aae9-8e821f581431", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nrCI95%p-valBF10power
pearson2440.675734[0.6, 0.74]6.692471e-344.952e+301.0
\n", "
" ], "text/plain": [ " n r CI95% p-val BF10 power\n", "pearson 244 0.675734 [0.6, 0.74] 6.692471e-34 4.952e+30 1.0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Correlate\n", "pg.corr(tips['tip'], tips['total_bill'])" ] }, { "cell_type": "markdown", "id": "80f8a602-0095-4ff8-a39e-63ceafe488ad", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "Now lets perform a linear model, scaling both the DV and predictor (which is which does not matter):" ] }, { "cell_type": "code", "execution_count": 17, "id": "7bb7d7a0-e0dc-4a1c-9cf0-cff202868e5c", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: scale(total_bill) R-squared: 0.457
Model: OLS Adj. R-squared: 0.454
No. Observations: 244 F-statistic: 203.4
Covariance Type: nonrobust Prob (F-statistic): 6.69e-34
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 7.286e-16 0.047 1.54e-14 1.000 -0.093 0.093
scale(tip) 0.6757 0.047 14.260 0.000 0.582 0.769


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & scale(total\\_bill) & \\textbf{ R-squared: } & 0.457 \\\\\n", "\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.454 \\\\\n", "\\textbf{No. Observations:} & 244 & \\textbf{ F-statistic: } & 203.4 \\\\\n", "\\textbf{Covariance Type:} & nonrobust & \\textbf{ Prob (F-statistic):} & 6.69e-34 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{Intercept} & 7.286e-16 & 0.047 & 1.54e-14 & 1.000 & -0.093 & 0.093 \\\\\n", "\\textbf{scale(tip)} & 0.6757 & 0.047 & 14.260 & 0.000 & 0.582 & 0.769 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{OLS Regression Results}\n", "\\end{center}\n", "\n", "Notes: \\newline\n", " [1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: scale(total_bill) R-squared: 0.457\n", "Model: OLS Adj. R-squared: 0.454\n", "No. Observations: 244 F-statistic: 203.4\n", "Covariance Type: nonrobust Prob (F-statistic): 6.69e-34\n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Intercept 7.286e-16 0.047 1.54e-14 1.000 -0.093 0.093\n", "scale(tip) 0.6757 0.047 14.260 0.000 0.582 0.769\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Linear model\n", "smf.ols('scale(total_bill) ~ scale(tip)', \n", " data=tips).fit().summary(slim=True)" ] }, { "cell_type": "markdown", "id": "34679c6e-408c-49a2-92a7-70cb1c46088e", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "source": [ "The intercept is as close to zero as our computer can handle, and the coefficient equals the correlation!" ] }, { "cell_type": "markdown", "id": "27a1439c-919d-4ed9-b95d-5cc51d4fe079", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "A t-test is also simple, comparing tips between females and males:" ] }, { "cell_type": "code", "execution_count": 18, "id": "787cae76-18d5-43a4-a5c6-645d98ff0026", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Tdofalternativep-valCI95%cohen-dBF10power
T-test-1.38786242two-sided0.166456[-0.62, 0.11]0.1854940.3610.282179
\n", "
" ], "text/plain": [ " T dof alternative p-val CI95% cohen-d BF10 \\\n", "T-test -1.38786 242 two-sided 0.166456 [-0.62, 0.11] 0.185494 0.361 \n", "\n", " power \n", "T-test 0.282179 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Conduct t-test\n", "pg.ttest(tips.query('sex == \"Female\"')['tip'], \n", " tips.query('sex == \"Male\"')['tip'],\n", " correction=False) # uncorrected t-tests are at least the same :) " ] }, { "cell_type": "code", "execution_count": 19, "id": "c3345dda-e419-4e02-a5b7-9b185a96e3da", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: tip R-squared: 0.008
Model: OLS Adj. R-squared: 0.004
No. Observations: 244 F-statistic: 1.926
Covariance Type: nonrobust Prob (F-statistic): 0.166
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 3.0896 0.110 28.032 0.000 2.873 3.307
sex[T.Female] -0.2562 0.185 -1.388 0.166 -0.620 0.107


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & tip & \\textbf{ R-squared: } & 0.008 \\\\\n", "\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.004 \\\\\n", "\\textbf{No. Observations:} & 244 & \\textbf{ F-statistic: } & 1.926 \\\\\n", "\\textbf{Covariance Type:} & nonrobust & \\textbf{ Prob (F-statistic):} & 0.166 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{Intercept} & 3.0896 & 0.110 & 28.032 & 0.000 & 2.873 & 3.307 \\\\\n", "\\textbf{sex[T.Female]} & -0.2562 & 0.185 & -1.388 & 0.166 & -0.620 & 0.107 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{OLS Regression Results}\n", "\\end{center}\n", "\n", "Notes: \\newline\n", " [1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: tip R-squared: 0.008\n", "Model: OLS Adj. R-squared: 0.004\n", "No. Observations: 244 F-statistic: 1.926\n", "Covariance Type: nonrobust Prob (F-statistic): 0.166\n", "=================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "---------------------------------------------------------------------------------\n", "Intercept 3.0896 0.110 28.032 0.000 2.873 3.307\n", "sex[T.Female] -0.2562 0.185 -1.388 0.166 -0.620 0.107\n", "=================================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# As a linear model\n", "ttest_model = smf.ols('tip ~ sex', data=tips).fit()\n", "ttest_model.summary(slim=True)" ] }, { "cell_type": "markdown", "id": "395eda4b-a49b-43fc-8380-de09a07b4a7d", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "And we can confirm the mean difference is as the model suggests:" ] }, { "cell_type": "code", "execution_count": 20, "id": "748d6122-ecea-44d1-ba62-974fa88fee67", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tip
sex
Male3.089618
Female2.833448
\n", "
" ], "text/plain": [ " tip\n", "sex \n", "Male 3.089618\n", "Female 2.833448" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "tip -0.25617\n", "dtype: float64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Confirm mean difference\n", "means = tips.groupby(by=['sex'], observed=False).agg({'tip': 'mean'})\n", "display(means)\n", "display(means.loc['Female'] - means.loc['Male'])" ] }, { "cell_type": "markdown", "id": "961fd37b-a479-472e-98da-1af9c6bb1ccd", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "For a final graphical illustration, let us recode the `sex` variable of tips into a column that has:\n", "- Males are 0\n", "- Females are 1\n", "\n", "And plot that against the `fittedvalues`, and overlay the raw data." ] }, { "cell_type": "code", "execution_count": 21, "id": "5dc91ef0-a2c5-4df0-b191-363bb7c70293", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# Generate values\n", "female_one_male_zero = tips['sex'].case_when(\n", " [\n", " (tips['sex'] == \"Female\", 1),\n", " (tips['sex'] == \"Male\", 0)\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 22, "id": "c9793947-e3ee-47a4-b61c-775937702ef2", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Raw data\n", "axis = sns.stripplot(data=tips, y='tip', x='sex', alpha=.3)\n", "# Adds the means of each sex\n", "sns.pointplot(data=tips, y='tip', x='sex', \n", " linestyle='none', color='black', ax=axis, zorder=3)\n", "# Add the predictions\n", "sns.lineplot(x=female_one_male_zero,\n", " y=ttest_model.fittedvalues,\n", " color='black');" ] }, { "cell_type": "markdown", "id": "1976623f-3469-465c-9f93-cb57cc18e2c8", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "Finally, we will consider how a one-way ANOVA works, and demonstrate how a linear model 'sees' an ANOVA.\n", "\n", "Lets compare the means of tips given on the different days (Thursday, Friday, Saturday, Sunday). The traditional approach here is the one-way ANOVA, done like so:" ] }, { "cell_type": "code", "execution_count": 23, "id": "5c137180-2512-4401-b7d6-640d58f5aa60", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sourceddof1ddof2Fp-uncn2
0day32401.6723550.1735890.020476
\n", "
" ], "text/plain": [ " Source ddof1 ddof2 F p-unc n2\n", "0 day 3 240 1.672355 0.173589 0.020476" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# One way ANOVA\n", "pg.anova(data=tips, between=['day'], dv='tip', effsize='n2')" ] }, { "cell_type": "markdown", "id": "72f7f357-6830-473b-8125-92550eb03fa0", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "Building this model as a GLM is conceptually very simple - we predict tips, from day. Lets fit this model and explore some consequences." ] }, { "cell_type": "code", "execution_count": 24, "id": "f29a1b13-ab27-4a12-a437-e2df1ab8a7a5", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: tip R-squared: 0.020
Model: OLS Adj. R-squared: 0.008
No. Observations: 244 F-statistic: 1.672
Covariance Type: nonrobust Prob (F-statistic): 0.174
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 2.7715 0.175 15.837 0.000 2.427 3.116
day[T.Fri] -0.0367 0.361 -0.102 0.919 -0.748 0.675
day[T.Sat] 0.2217 0.229 0.968 0.334 -0.229 0.673
day[T.Sun] 0.4837 0.236 2.051 0.041 0.019 0.948


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & tip & \\textbf{ R-squared: } & 0.020 \\\\\n", "\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.008 \\\\\n", "\\textbf{No. Observations:} & 244 & \\textbf{ F-statistic: } & 1.672 \\\\\n", "\\textbf{Covariance Type:} & nonrobust & \\textbf{ Prob (F-statistic):} & 0.174 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{Intercept} & 2.7715 & 0.175 & 15.837 & 0.000 & 2.427 & 3.116 \\\\\n", "\\textbf{day[T.Fri]} & -0.0367 & 0.361 & -0.102 & 0.919 & -0.748 & 0.675 \\\\\n", "\\textbf{day[T.Sat]} & 0.2217 & 0.229 & 0.968 & 0.334 & -0.229 & 0.673 \\\\\n", "\\textbf{day[T.Sun]} & 0.4837 & 0.236 & 2.051 & 0.041 & 0.019 & 0.948 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{OLS Regression Results}\n", "\\end{center}\n", "\n", "Notes: \\newline\n", " [1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: tip R-squared: 0.020\n", "Model: OLS Adj. R-squared: 0.008\n", "No. Observations: 244 F-statistic: 1.672\n", "Covariance Type: nonrobust Prob (F-statistic): 0.174\n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Intercept 2.7715 0.175 15.837 0.000 2.427 3.116\n", "day[T.Fri] -0.0367 0.361 -0.102 0.919 -0.748 0.675\n", "day[T.Sat] 0.2217 0.229 0.968 0.334 -0.229 0.673\n", "day[T.Sun] 0.4837 0.236 2.051 0.041 0.019 0.948\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fit the ANOVA style model\n", "anova = smf.ols('tip ~ day', data=tips).fit()\n", "anova.summary(slim=True)" ] }, { "cell_type": "markdown", "id": "c6b4e472-c2a7-4f56-b5b7-5c6388d1f50b", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "Note that the $R^2$ is the same, as is the F statistic and the p-value!\n", "\n", "The coefficients are interesting here. You can see they state Fri, Sat, and Sun. Where is Thursday? Much like in the t-test example, *Thursday is absorbed into intercept*. \n", "Each coefficient represents the difference between the intercept and that particular day. When all the coefficient are zero (i.e., not that day) then the model returns the mean of Thursdays tips. Again we will focus on our models capabilities to predict things as we continue, but in these instances it is good to know what the coefficients mean before things get too complex.\n" ] }, { "cell_type": "markdown", "id": "b505ace7-0c00-44a2-8260-228b92b724ed", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### The design matrix\n", "This is a technical aside but can be useful in gaining understanding as we move forward. Behind every model you build in `statsmodels`, there is something called the *design matrix*. The data you provide isn't always the data that the model is fitted to, but it is transformed in a way that OLS can use. For example, you can't do statistics on strings ('Thursday', 'Female', etc) - there is a transform that happens to make a numeric version of it. This is sometimes referred to as 'dummy coding', where a categorical variable will be spread into as many columns as there are levels of the variable, with a '1' representing a particular level and zero elsewhere. \n", "\n", "An example will help. Consider the following dataframe that has some simple categorical-style data." ] }, { "cell_type": "code", "execution_count": 25, "id": "93484d38-ae2a-41b0-9637-d9b3d3dae20f", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
categories
0Wales
1Wales
2England
3Scotland
4Ireland
\n", "
" ], "text/plain": [ " categories\n", "0 Wales\n", "1 Wales\n", "2 England\n", "3 Scotland\n", "4 Ireland" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create a small categorical dataset\n", "df = pd.DataFrame({'categories': ['Wales', 'Wales', 'England', \n", " 'Scotland', 'Ireland']})\n", "display(df)" ] }, { "cell_type": "markdown", "id": "ee79dae4-c14c-4a16-ae35-768477118575", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "We could imagine this is a predictor in a regression. In its current form, it won't work. However, we can import something called `patsy`, a package that does the translating of datasets for us for our models. It has a function called the `dmatrix` which will turn this data into what a model will see, that is, the design matrix. It works in much the same way as `smf.ols` does. Lets import it and demonstrate:" ] }, { "cell_type": "code", "execution_count": 26, "id": "cb0a0518-ca67-496b-82e2-645884c8b353", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptcategories[T.Ireland]categories[T.Scotland]categories[T.Wales]
01.00.00.01.0
11.00.00.01.0
21.00.00.00.0
31.00.01.00.0
41.01.00.00.0
\n", "
" ], "text/plain": [ " Intercept categories[T.Ireland] categories[T.Scotland] \\\n", "0 1.0 0.0 0.0 \n", "1 1.0 0.0 0.0 \n", "2 1.0 0.0 0.0 \n", "3 1.0 0.0 1.0 \n", "4 1.0 1.0 0.0 \n", "\n", " categories[T.Wales] \n", "0 1.0 \n", "1 1.0 \n", "2 0.0 \n", "3 0.0 \n", "4 0.0 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import just the design matrix\n", "from patsy import dmatrix\n", "dmatrix('~ categories', data=df, return_type='dataframe') \n", "# notice there's no DV but the formula is the same!" ] }, { "cell_type": "markdown", "id": "bbd30423-96c2-445e-a069-b8745c2fb953", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "You can see that it has chosen England to become the intercept (this is chosen alphabetically), and each column has a '1' where the title of that column exists in the original data. This representation is what Python uses to build our models. This is not essential to know but it will deepen your understanding." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 5 }